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A Distribution-Free Test for Model Comparisons 



1. Introduction 

Consider a finite set of m mathematical models which have each provided 
estimates of subject data at n trial points. A general problem of model 
comparisons is concerned with deciding if any one model is a better "fit" 
of the data than the other models when there is no universally accepted yard- 
stick of "fit 1 ' or standard statistical test [Bush and Mosteller, 1959], The 
most common approach to the problem is to compare each model to the data using 
some y?“lik e procedure. The basic assumption involved therein is one of 
independence across trials (or blocks of trials) iri order to satisfy the 
additivity of the statistical test model. However, if the models are in any 
way path-dependent and it is expected that the fit of the models are functions 
of n, then the use of such comparison tests is inappropriate since some trials 
would be more important than others for model comparison purposes. Tests of 
the Kolmogorov-Smirnov , Cram^r-von Mises type (e.g., Birnbaum, 1953; Darling, 
1957; Massey, 1951) and others (e.g., Anderson 6c Darling, 1954; Riedwyl, 1967; 
Tsao, 1955) require a continuous cumulative distribution function for the 
random variable which accounts for the data. Atkinson (1969) presents several 
tests for model comparisons which measure the deviations of each model's 
predictions from some '’best 11 formula which is found by regression. 

This paper proposes a distribution-free index for model comparisons which 
makes no assumptions about continuity. The index also allows for differential 
weighting of trials according to the effect of each trial on the datf . For 
example, suppose one is comparing several learning models and it can be assumed 
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that each model's proximity (as measured by some definition of proximity) to the 
data is a monotonically increasing function of n. Then the trials over which 
the models are compared should be differentially weighted by giving heavier 
weights to the later trials; for it is in these trials that the models are ex- 
pected to provide better estimates of the data points. 

Let the measure of closeness of model j at trial i be defined as 




where y^j is the estimate of the data value y^ given by model j at trial i. It 
is apparent that if model j is a closer fit of the data than the other m-1 models, 
then values will generally be smaller than values, k / j (and vice versa 
for a poorer fit). 

The goodne ss-of-f it index, defined in Section 2, assigns positive integer 
ranks, r$j, to the f^j for all i and j. Then weights, w^ , are assigned to the 
comparison points based on theoretical or empirical considerations as to the 
importance of each trial for comparison purposes. Some properties, including 
the conditions on the w^ for asymptotic normality of the index distribution are 
given. In Section 3, a discussion of large sample single-model and simultaneous 
inference is presented. Section 4 suggests several possible permissible 
weighting schemes and Section 5 illustrates the procedure with three probability 
learning models . 



(i) 



Let the index, 



■ 



i 



i-1 



2. The Index and Its Properties 

denoted, Ij, be defined for the j th model as 

(m - r^) 
m-1 
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where m = number of models under consideration, 



■ _ = rank of f^j for the jUi model at trial i, 



*Y * 

P. = and 



I w k 

k=l 



w. = weight assigned to trial i. 



Some Properties of I j : 



Proofs of these properties have been ommitted for the sake of brevity. 



( 2 ) 



( 3 ) 



a) 0 < Ij < 1 . 

b) If r^j = k for all i then, 
j- _ m - k 



j m - I 

c) The maximum non-perfect (Ij ^ 1) value of Ij will occur when 
model j is ranked 2 for the data point having the smallest weight and ranked 1 
for all other data points. If rank w t is given the t Mi data point then the 
maximum non-perfect Ij value is given be 



( 4 ) 



1 - 



(m - 1) Y w l 
i=l 



( 5 ) 



d> i ij.j 

j*i 
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e) If it is assumed that model j fits the data no better than the 
other models over all trials, then 



(6) Edj) = 1/2 . 

f) If the rank value of model j at trial i is independent of the 
rank value of model j at trial i+k (k > 0) and model j fits the data no better 
than any other model, i.e., P(r^) = 1/m for all i, then 



(7) 



VAR(Ij) 



n 




m + 1 

1 2 (m - 1) 
- 



g) 

and Kolmogorov, 



A direct application of the Lindenbe rg-Fe 1 ler Theorem [Gnedenko 
1954] shows that if 



( 8 ) 




then Ij 
to hold. 



is asymptotically normally distributed. 



The converse can also be shown 



3. Significance Tests with Ij 



The implications of property g) 
independence assumption of property f ) , 
a test of 



is Section 2 is that, coupled with the 
we can, for sizable n, use Ij to conduct 




K„I P(f tJ > £ lk ) - P(f tJ < £ lk ), J t k, J fixed, 
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which is dis tr ibut ion- free under H q . Mote that this H q is equivalent to 

H 0 : P( r ij) = p ( r ik) for a11 1 > J t k. 

Since, under H 0 , lj is asymptotically normally distributed with mean and 
variance given respectively by (6) and (7), th>? statistic 

z = h : E(1 £ 

yVAR(Ij) 

is approximately distributed N(0,1). 

When confronted with more than one hypothesis, for example, when j is not 
fixed in (9 ), a device for scaling down the significance level can be used. One 
such device, resulting from the Bonferroni Inequality [Miller, 1966], suggests the 
Cf/2m level of significance for simultaneous two-tailed tests. Crude though this 
estimated significance level is, its derivation does not depend on the Ij, j = 1, 
2, . m, be i g independent as do most simultaneous test approximations. 



4. Some Weighting Functions 

The basic subjective portion in the development and use of index lj is the 
assignment of weights w-, i = 1, 2, . . . , n to the trials. This will depend on 
the relative importance which the experimenter places on the trials used for the 
comparisons of the models and can take on almost any functional form. There are, 
however, several which (a) satisfy the condition in expression (8) for asymptotic 
normality , (b) are rational, and (c) possess mathematical simplicity. Three 
such weighting functions and their resulting variances are herein presented. 



Function 1: w^ = c, c £ o . 



The effect here is one of proposing that the trials are all equal for 
comparison purposes. This would be the case if an experimeter assumed random 
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behavior models and did not expect the models to be better fits toward the end 
of the trials than at the beginning of the trials or vice versa. 

Here the variance of I ^ under the assumptions of property f) becomes 



( 10 ) 



VAR(Ij) = 



m + 1 



12n(m - 1) 



Function 2: ■ i 



For this case the assumption is that the later comparison trials are 
more important than the earlier trials. This would be appropriate if an 
experimenter felt that the models required the earlier trials to sequentially 
reach a point beyond which the comparisons with data would be reasonable. 

Under this scheme and the assumptions of property f), 



(ID 



VAR (I) 



(2n ± l)(m + 1) 
18(n + 1 ) (n) (m - 1) 



Function 3: = n - i + 1 

The assumption for this scheme is that the earlier trials are more 
important for model comparison purposes than the later trials and would be 
appropriate to use if one believed that some kind of "fatigue’* factor was 
involved. For example, suppose it was suspected that beyond trial k, the 
behavior being modeled gradually began to act in a random or erratic fashion. 
Then the later trials could be thought of as "unreliable" for model comparisons. 
The variance here is the same as under Function 2 except for subsets of trials 
in which the summation of (1) does not run over the full range from 1 to n. 




5. Example 

Three probability learning models were compared over the last 10 trials 

►** 
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of a two-choice experiment. In this experiment, human subjects were asked to 
predict which of two possible events, E^ or E 2 , would occur on each of a series 
of trials. Predictions of E^ and E 2 are denoted by responses and A 21 
respectively. At the end of each trial, the subjects were permitted to observe 
which event actually occurred. Event E^ had a fixed probability Tf = . 7 of 
occurring in a random sequence. 

Model 1, a linear operator model, is due to Estes (1950) and in this 
experiment assumed the form 

(12) Pr (A 1 on trial n+1) = P ln+1 = TT - [ TT - > n ] ( 1 - ©) n ' 1 

where 0 is a rate of learning parameter, 0 < 0 < 1, estimated from observed 
response frequencies. This model is usually applied to experiments with many 
more trials than the experiment of this example, but is included here for 
illustrative purposes only. 

Models 2 and 3 are of the form 



(13) 



1 , n+ 1 



where e is an experience vector of length n representing the trials 1, 2, n 

and composed of the digits one or zero depending on whether or not E^ occurred 
on a particular trial, and |i is a memory vector whos: n elements are propor- 
tional to the probabilities of recalling the events of trials 1 to n such that 
n 

^ = 1 , The development of the basic theory relative to (13) is due to 

i-1 

Overall (1960). 




Models 2 and 3 differ in the estimation of elements in the memory vector, 
|i . Model 2 employed probabilities which were empirically determined by 
Murdock (1962) under a variety of recall conditions, none of which involved 
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binary items, The probability of recalling the i th item in a sequence of length 
n was 

(14) P(i, n) = 1.00 + . 27e * • 77 (1 _1 ) . . 772(. 042) ' 555 ^ ^ . 

The probabilities used in Model 3 were estimated from the data of a 
previous experiment where subjects engaged in the two-choice task made recalls 
at periodic intervals. These probabilities were 

(15) P(i, n) = .9047 - .0694(1) + .0775(i 2 /n) - .3577(i/n ) 2 . 

Thirty-eight subjects performed the two-choice experiment with TT - . 7 . 
Table 1 presents the results of the last 10 trials. 

Suppose that Model 3 is of particular interest. If a test of 

H 0 : P(f i3 > f ik ) - P(f i3 < f ik ) , i “ lj 2, ..., 10 , k = 1,2 , 

is conducted at a = .05 under the weights of Function 2, = ,8000 and 

Zj = — ~-- 5 = 2.059 . Since P(z > 2.059) ~.02, H r Is rejected and Model 3 
l 3 .1457 ~ 0 J 

can be judged to be a significantly better fit than the other models. However, 

with the weights of Function 1, = .7500 and Z T = = .637, so that 

3 I 3 .4083 

P(z > .637) ^ .26 . Thus H 0 cannot be rejected. Models 1 and 2 are not signi- 
ficant under either weighting function since their index values are .4545 and 
.2455 respectively using Function 2, and .45 and .30 respectively with Function 
1 . The Bonferroni test was not significant at the .05 level using Functions 1 
or 2 . 
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COMPARISON OF THREE PROBABILITY LEARNING MODELS 
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Comments and Discussion 

In any procedure involving assigned ranks the problem of tied rankings 
merits discussion. We have purposely avoided the issue since the standard 
procedures of randomly breaking ties or assigning average ranks a ri_ quite 
satisfactory for small numbers of ties. If the number of ties is large, say 
> 20%, then we recommend that 1^ not be used as an index of goodne ss-o f- f it . 

No adjustments are herein proposed to account for the number of ties in 
calculating I y 

The tests of significance presented in Section 3 were exclusively for 
"large" sample sizes, and we propose that n > 10 is sufficient to warrant the 
use of the z statistic for testing purposes. Figure 1 graphically illustrates 
the exact distribution of 1^ under H Q for n = 10, m = 2, and w^ = i, 
i - 1, 2, ..., 10. It appears that the normal distribution would yield quite 
reasonable approximations. A more practical reason for not deriving and 
providing exact distributions for n < 10 is purely financial. The reader 
appreciates the massive computing job necessary to provide these distributions 
for m > 2 and the need for such small sample distributions (since number of 
trials usually exceeds 10) does not justify the expenditure at present. Only 
if one were estimating a small number of parameters by several models would 
exact small-sample distributions be desirable. 

The primary concern in this paper has been with the presentation of an 
index of relative goodne ss -of- fit rather than its accompanying tests of 
significance. This was due to the restrictive assumptions underlying the 
test of H . The testable hypothesis itself may be so general in form that 
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a researcher would not want to test it in the first place. This, as in any 
test of hypothesis, does not invalidate the use of the index with its accom- 
panying properties fur descriptive purposes as a good ness-of-f it indicator. 

The aspect of selecting weight functions is obviously a crucial one and 
we have presented only a small selection of simple functions. We have also 
assumed that the same weight function would be used for each model for a 
particular comparison. The problem of assigning a priori a '‘best 11 function 
per model and then making the comparisons was not addressed and indeed would 
be a difficult problem to handle statistically. 
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APPENDIX 



S imulated 



Distributions of any 1^ 



for 



m = 2, 3, 
Pr(r lj ) 



4, 5 over 100 Trials Assuming 
= l/ 4 n for i = 1, 2, ...100. 
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